ThesWB: A Tool for Thesaurus Construction from HTML Documents

نویسندگان

  • Yousef Abuzir
  • Fernand Vandamme
چکیده

Electronically available documents on the Web are exploding at an ever-increasing rate. Many Web documents, however, contain rich knowledge that describes the document's content. The Web can be viewed as a body of text containing two fundamentally different types of data: the contents and the tags. A tag is in HTML (Hyper-Text Markup Language) meta-data describing the layout and linking structure between the text. For these kinds of documents we can apply different approaches to extract and structure terms automatically. These approaches are based on a statistical model and syntactic analysis that describe the data of interest, including relationships, and context keywords. In this paper, we discuss an approach to extracting and structuring terms from documents posted on the Web to construct a thesaurus. The proposed tool, ThesWB is used to construct domain independent thesaurus from HTML pages. ThesWB is used to capture the internal structure of meta information embedded in HTML documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A semi-automatic indexing system based on embedded information in HTML documents

Purpose – This paper describes and evaluates the tool DigiDoc MetaEdit which allows the semi-automatic indexing of HTML documents. The tool works by identifying and suggesting keywords from a thesaurus according to the embedded information in HTML documents. This enables the parameterization of keyword assignment based on how frequently the terms appear in the document, the relevance of their p...

متن کامل

DETECTING SIMILAR HTML DOCUMENTS USING A SENTENCE-BASED COPY DETECTION APPROACH by

DETECTING SIMILAR HTML DOCUMENTS USING A SENTENCE-BASED COPY DETECTION APPROACH Rajiv Yerra Department of Computer Science Master of Science Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional s...

متن کامل

Use of Keyphrase Extraction Software for Creation of an Aec/fm Thesaurus

The paper describes a method used to collect terms needed for the development of a thesaurus in the roofing domain. This work is part of a larger effort to investigate the potential of thesauri as an aid in product modeling and as a tool for information management in model-based systems. Extractor, a software module that extracts keyphrases from documents, was used for collecting candidate thes...

متن کامل

مسائل اصطلاحنامه سازی در ایران از دیدگاه تهیه کنندگان اصطلاحنامه

Introduction: The present research attempts to study the theoretical foundations of thesaurus construction before and after internet and identify the problems of thesaurus construction in Iran from the point of view of thesaurus makers and translators of the published thesauri.. Methods: The research population was 6 thesaurus makers (AbdolHossein Azaragn, Abbas Hori, Fatemeh Rahadoost, Faribor...

متن کامل

CITOM: An Incremental Construction of Topic Maps

This paper proposes the CITOM approach for an incremental construction of multilingual Topic Maps. Our main goal is to facilitate user’s navigation across documents available in different languages. Our approach takes into account three types of information sources: (a) a set of multilingual documents, (b) a domain thesaurus and (c) all the possible questioning sources such as FAQ and user’s or...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001